In this tutorial, we will analyze single-cell RNA sequencing data
from Ho et. al,
Genome Research, 2018. They are derived from 451Lu melanoma cell
line in two different conditions: 1) parental (untreated) and 2)
resistant (treated for 6 weeks with BRAF inhibitors and yet
proliferating hence called resistant):
Importantly the differenceS between these two cell populations are 1) the exposure to treatment (none versus chronic); 2) the resistance that emerges from the treatment. Hence these two variables (treatment and resistance) are confounded by the experimental set-up. This tutorial is built on the Seurat package; you can refer to their vignettes for more details about the functions.
Data required for this tutorial can be dowloaded from Zenodo: https://zenodo.org/records/11088511
The analysis of alternative 3ā UTR usage is based on pipeline previously published (Andreassi et al, 2021) and further improved by Lisa Fournier (TA).
apa_data <- read.csv("./APA_table.csv")
#Data:
# apa_matrix : 11232 x 9 table; the columns are:
# gene: gene name
# proximal_id: the name of the proximal 3' UTR isoform of the pair (Ip)
# proximal_length: length (nt) of the proximal 3' UTR isoform (Ip)
# proximal_occurence_parental: parental pseudo-bulk count for the proximal isoform (sum over all parental cells)
# proximal_occurence_resistant: resistant pseudo-bulk count for the proximal isoform (sum over all resistant cells)
# distal_id: the name of the distal 3' UTR isoform of the pair (Id)
# distal_length: length (nt) of the distal 3' UTR isoform (Id)
# distal_occurence_parental: parental pseudo-bulk count for the distal isoform (sum over all parental cells)
# distal_occurence_resistant :resistant pseudo-bulk count for the distal isoform (sum over all resistant cells)
mycols <- c("#6699FF", "#CC33CC")#Calculate PUD and RUD for each condition
apa_data <- apa_data %>%
mutate(pud_parental = proximal_occurence_parental / (proximal_occurence_parental + distal_occurence_parental))
apa_data <- apa_data %>%
mutate(pud_resistant = proximal_occurence_resistant / (proximal_occurence_resistant + distal_occurence_resistant))
apa_data <- apa_data %>%
mutate(rud_parental = log2(proximal_occurence_parental / distal_occurence_parental))
apa_data <- apa_data %>%
mutate(rud_resistant = log2(proximal_occurence_resistant / distal_occurence_resistant))
# Drop the rows with infinite RUD values for the plots
apa_data <- apa_data[is.finite(apa_data$rud_parental), ]
apa_data <- apa_data[is.finite(apa_data$rud_resistant), ]
# plot the PUD distributions
layout(matrix(ncol=3,nrow=2,c(1,1,3,2,2,4),byrow = TRUE))
multidensity(list(parental=apa_data$pud_parental,resistant=apa_data$pud_resistant),col=mycols, las=1, main="PUD", xlab="PUD")
grid()
multidensity(list(parental=apa_data$rud_parental,resistant=apa_data$rud_resistant),col=mycols, las=1, main="RUD", xlab="RUD")
grid()
boxplot(apa_data$pud_parental, apa_data$pud_resistant, col=mycols, names=c("parental", "resistant"),las=1,frame=FALSE)
result <- wilcox.test(apa_data$pud_parental, apa_data$pud_resistant)
title(paste("P=", format(result$p.value,scientific=TRUE,digit=2)))
boxplot(apa_data$rud_parental, apa_data$rud_resistant, col=mycols, names=c("parental", "resistant"), outline=FALSE,las=1,frame=FALSE)
result <- t.test(apa_data$rud_parental, apa_data$rud_resistant)
title(paste("P=", format(result$p.value,scientific=TRUE,digit=2)))df <- data.frame(type=c(rep("parental",nrow(apa_data)),rep("resistant",nrow(apa_data))),
pud=c(apa_data$pud_parental,apa_data$pud_resistant),
rud=c(apa_data$rud_parental,apa_data$rud_resistant))The PUD distributions are bimodal. Therefore, due to the non-normality of the data, to compare them we will use a non-parametric test: the Wilcoxon test. Concerning the RUD, the data are normal. We therefore use the Student t-test.
We can now identify the genes exhibiting changes in 3ā UTR usage by comparing their PUD between resistant and parental. P-values are obtained using the Fisher count test on the raw read count and changes are identified when \(\Delta PUD >0.15\) or \(\Delta PUD <(-0.15)\) and \(P-value<0.01\).
#Calculate dPUD and dRUD
apa_data <- apa_data %>%
mutate(dpud = pud_resistant - pud_parental )
apa_data <- apa_data %>%
mutate(drud = rud_resistant - rud_parental )
#Counts to compute the fisher tests
df_for_fisher = round(apa_data[, c("proximal_occurence_parental", "proximal_occurence_resistant", "distal_occurence_parental", "distal_occurence_resistant")])
# Apply Fisher's Test to each row
results <- apply(df_for_fisher, 1, function(x) {
# Create a matrix for each row
mat <- matrix(as.numeric(x), nrow = 2)
# Perform Fisher's Test
fisher.test(mat)$p.value
})
apa_data$p_value <- results
apa_data$fdr <- p.adjust(apa_data$p_value,method="fdr")
# Selection of differential 3' UTR usage
is_sig <- apa_data$fdr < 0.01
dist <- (apa_data$dpud <= (-0.15) & apa_data$drud <= (-1))
short <- (apa_data$dpud >= (0.15) & apa_data$drud >= 1)
sel_distal <- is_sig & dist
sel_proximal <- is_sig & short# Make a nice plot
coldiff <- c("grey",rgb(0,0,0,0.15))
colsA <- c("#81A4D6","#2D598E","#083872")
colsB <- c("#AE73B1","#79387C","#57055B")
plot(apa_data[,c("pud_parental","pud_resistant")],pch=19,col=rgb(0,0,0,0.1),cex=0.3,las=1,frame=FALSE,xlab="Ip/(Ip+Id) [parental]",ylab="Ip/(Ip+Id) [resistant]")
points(apa_data[sel_proximal,c("pud_parental","pud_resistant")],pch=19,col=colsA[1],cex=0.3)
text(x=0.6,y=1.0,labels=paste(sum(sel_proximal),"proximal-to-distal shifts in resistant"),cex=0.6,col=colsA)
points(apa_data[sel_distal,c("pud_parental","pud_resistant")],pch=19,col=colsB[1],cex=0.3)
text(x=0.6,y=0.0,labels=paste(sum(sel_distal),"distal-to-proximal shifts in resistant"),cex=0.6,col=colsB)For details how to produce the data, please refer to the exercises of week 9. You are free the test the different normalisation methods which are all included in the R object.
Prior to unsupervised clustering, dimensionality reduction is often applied on the most informative genes/features. As performed in week 9, we can use different methods for this task.
Selection of top 1000 variable genes.
Prior performing the PCA we first need to scale and center the genes.
layout(matrix(ncol=3,nrow=2,c(1:6),byrow = TRUE))
# Scale and Center gene features
GE_log <- ScaleData(GE_log, features =rownames(GE_log))
#Run PCA
GE_log <- RunPCA(GE_log, features = VariableFeatures(object = GE_log))
#Run t-SNE
GE_log <- RunTSNE(GE_log, dims = 1:10,perplexity=10)
#Run UMAP
GE_log <- RunUMAP(GE_log, dims = 1:10,n.neighbors=5,min.dist=0.4,metric="man")PCA plot:
pt<-DimPlot(GE_log, reduction = "pca", group.by = "treatment", pt.size=1, cols = mycols,dims = c(1, 2),alpha=0.2) +ggtitle("PCA")
ptt-SNE plot:
pt<-DimPlot(GE_log, reduction = "tsne", group.by = "treatment", pt.size=1, cols = mycols,dims = c(1, 2),alpha=0.2) +ggtitle("t-SNE")
ptpt<-DimPlot(GE_log, reduction = "umap", group.by = "treatment", pt.size=1, dims = c(1, 2), cols = mycols,alpha=0.2)+ggtitle("UMAP")
ptThis method transforms the counts using the Pearson residuals of a probabilistic model. However even scTransform leads to undesirable biases that can mask true signals and upweight technical artifacts. One of the most popular method for count-based modeling. The idea behind scTransform is to apply PCA to the matrix of Pearson residuals obtained by fitting negative binomial GLMs to the count matrix \(Y\). Specifically for each row \(i\) scTransform fits the following model to data \(Y_{i1},\cdot\cdot\cdot,Y_{iJ}\): \(Y_{ij}\sim NegBinom (\mu_{ij},\alpha_i)\) and \(\log \mu_{ij}=\beta_{0i}+\beta_{1i}\log(S_j)\).
It is known that fundamental issues can arise from the commonly used approach of running PCA on \(\log(1+x)\) transformed scRNA-seq data. For instance, (Townes et al.Ā (2019)) showed that the first principal component is strongly correlated with the number of zeros per cell, even on null data. Methods using count models have been developed to address these issues, however, we find that the current leading approaches still have significant limitations.
While most dimensionality reduction approaches apply a transformation to the count matrix followed by principal components analysis (PCA) or any other type, it is known that such approach can induce spurious heterogeneity and mask true biological variability. An alternative approach is to directly model the counts. Some methods perform dimensionality reduction directly using a probabilistic model of the count data matrix.These methods can avoid the artifactual biases of simple transformations and, further, can provide principled uncertainty quantification for downstream analyses and visualization. These include GLM-PCA, which models the entries of the count matrix using a Poisson or negative-binomial distribution and estimates latent factors in the log space. However GLM-PCA suffers from slow runtime and convergence issues on single-cell datasets with millions of cells.
scGBM is a novel method for model-based dimensionality reduction of scRNA-seq data using a Poisson bilinear model. You can access the preprint here. They introduce a fast estimation algorithm to fit the model using iteratively reweighted singular value decompositions. Furthermore, scGBM quantifies the uncertainty in each cellās latent position and leverages these uncertainties to assess the confidence associated with a given cell clustering. Starting from the same underlying model as in GLM-PCA however with a new estimation algorithm that is faster than existing approaches.
if (!require("devtools", quietly = TRUE))
install.packages("devtools")
library("devtools")
if (!require("scGBM", quietly = TRUE))
devtools::install_github("phillipnicol/scGBM")
library("scGBM")
#https://github.com/phillipnicol/scGBM
set.seed(1126490984)
#Run scGBM with latent factors
out <- scGBM::gbm.sc(as.matrix(GE@assays$RNA$counts),M=10)# M is the number of latent factors
#out.proj <- scGBM::gbm.sc(as.matrix(GE@assays$RNA$counts),M=2,subset=100,ncores=8) ; #
colnames(out$scores) <- 1:10
GE[["gbm"]] <- CreateDimReducObject(embeddings=out$scores,key="GBM_")
save(list=c("out","GE"),file="./results_scGBM.RDtata")pt <- DimPlot(GE, reduction = "gbm", group.by = "treatment", pt.size=1, cols = mycols, dims = c(1, 2))+ggtitle("GBM")
ptSeurat applies a graph-based clustering approach, building upon initial strategies in (Macosko et al). First, a KNN graph based on the euclidean distance in PCA space in constructed, and the edge weights between any two cells are refined based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs).
To cluster the cells, modularity optimization techniques such as the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics] are applied, to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the āgranularityā of the downstream clustering, with increased values leading to a greater number of clusters. The authors of Seurat found that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets.
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6414
## Number of edges: 207228
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9602
## Number of communities: 3
## Elapsed time: 0 seconds
It is now possible to extract all cells originating from one cluster:
Here the goal is to find markers that define clusters via differential expression (DE). You can either: - identify positive and negative markers of a single cluster compared to all other cells - test groups of clusters vs.Ā each other, or against all cells.
# find markers for every cluster compared to all remaining cells, report only the positive ones
# This cell takes ~15-20minutes to run
GE_log.markers <- FindAllMarkers(GE_log, only.pos = TRUE)
GE_log.markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1)## # A tibble: 1,754 Ć 7
## # Groups: cluster [3]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 0 3.31 0.957 0.257 0 0 BCAS3
## 2 0 4.78 0.608 0.021 0 0 MDK
## 3 0 4.43 0.58 0.019 0 0 RAC2
## 4 0 6.92 0.558 0.003 0 0 SLC17A9
## 5 0 3.42 0.618 0.096 0 0 UNC13D
## 6 0 2.21 0.727 0.206 0 0 S100A1
## 7 0 2.06 0.91 0.399 0 0 RAMP1
## 8 0 2.16 0.787 0.323 0 0 DCBLD2
## 9 0 2.30 0.955 0.494 0 0 CAPG
## 10 0 2.11 0.806 0.349 0 0 KCNAB2
## # ā¹ 1,744 more rows
GE_log.markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1) %>%
slice_head(n = 20) %>%
ungroup() -> top8
DoHeatmap(GE_log, features = top8$gene) + NoLegend()We can perform gene enrichment analysis on the top 300 u-regulated genes in cluster n0.2:
# print the top marker genes for cluster 2
gs<-head(GE_log.markers[GE_log.markers$cluster == 2,], n=300)$gene
gostres_diff <- gost(query =gs,
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = TRUE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", as_short_link = FALSE,sources=c("GO:BP", "GO:MF","KEGG","REAC"))
gostplot(gostres_diff, capped = FALSE, interactive = TRUE)In this task, use sctype for cell type annotation (see the documentation here: https://github.com/IanevskiAleksandr/sc-type). The expected output is a plot of the clusters with cell type annotation. You are free to use the dimensionality reduction technique that you want among the ones tested above; and to choose which tissue to test.
#https://github.com/IanevskiAleksandr/sc-type
# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
# load wrapper function
source("https://raw.githubusercontent.com/kris-nader/sc-type/master/R/sctype_wrapper.R");
# you are free to load any other script that you find relevant or that you want to test
library("HGNChelper")
### ADD YOUR CODE HERE ###
# DB file
db_ <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";
tissue <- "Immune system" # e.g. Immune system,Pancreas,Liver,Eye,Kidney,Brain,Lung,Adrenal,Heart,Intestine,Muscle,Placenta,Spleen,Stomach,Thymus
# prepare gene sets
gs_list <- gene_sets_prepare(db_, tissue)
# check Seurat object version (scRNA-seq matrix extracted differently in Seurat v4/v5)
seurat_package_v5 <- isFALSE('counts' %in% names(attributes(GE_log[["RNA"]])));
print(sprintf("Seurat object %s is used", ifelse(seurat_package_v5, "v5", "v4")))## [1] "Seurat object v5 is used"
# extract scaled scRNA-seq matrix
scRNAseqData_scaled <- if (seurat_package_v5) as.matrix(GE_log[["RNA"]]$scale.data) else as.matrix(GE_log[["RNA"]]@scale.data)
# run ScType
es.max <- sctype_score(scRNAseqData = scRNAseqData_scaled, scaled = TRUE, gs = gs_list$gs_positive, gs2 = gs_list$gs_negative)
# NOTE: scRNAseqData parameter should correspond to your input scRNA-seq matrix. For raw (unscaled) count matrix set scaled = FALSE
# When using Seurat, we use "RNA" slot with 'scale.data' by default. Please change "RNA" to "SCT" for sctransform-normalized data,
# or to "integrated" for joint dataset analysis. To apply sctype with unscaled data, use e.g. pbmc[["RNA"]]$counts or pbmc[["RNA"]]@counts, with scaled set to FALSE.
# merge by cluster
cL_resutls <- do.call("rbind", lapply(unique(GE_log@meta.data$seurat_clusters), function(cl){
es.max.cl = sort(rowSums(es.max[ ,rownames(GE_log@meta.data[GE_log@meta.data$seurat_clusters==cl, ])]), decreasing = !0)
head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum(GE_log@meta.data$seurat_clusters==cl)), 10)
}))
sctype_scores <- cL_resutls %>% group_by(cluster) %>% top_n(n = 1, wt = scores)
# set low-confident (low ScType score) clusters to "unknown"
sctype_scores$type[as.numeric(as.character(sctype_scores$scores)) < sctype_scores$ncells/4] <- "Unknown"
print(sctype_scores[,1:3])## # A tibble: 3 Ć 3
## # Groups: cluster [3]
## cluster type scores
## <fct> <chr> <dbl>
## 1 0 Platelets 891.
## 2 2 Unknown 3.94
## 3 1 Unknown 342.
GE_log@meta.data$sctype_classification = ""
for(j in unique(sctype_scores$cluster)){
cl_type = sctype_scores[sctype_scores$cluster==j,];
GE_log@meta.data$sctype_classification[GE_log@meta.data$seurat_clusters == j] = as.character(cl_type$type[1])
}
DimPlot(GE_log, reduction = "umap", label = TRUE, repel = TRUE, group.by = 'sctype_classification') ### Idea of what you could be the output:
#DimPlot(GE_log, reduction = "tsne", label = TRUE, repel = TRUE, group.by = 'sctype_classification') As we already know, melanoma is skin type cancer that begins in
pigment producing cells called melanocytes. Melanoma usually occurs on
the skin, but in about 5% of the cases it develops in melanocytes in
other tissues, including the eyes or mucous membrane that line the body
cavities, such as the moist lining of the mouth. If a melanoma becomes
thicker and involves multiple layers of the skin, it can spread to other
parts of the body (through lymphatic system).
Therefore, when we were supposed to select tissues for cell type
annotation, we opted for Immune system, Eyes and Muscle (since Skin was
not available). Clustering for the Immune system resulted in discovering
2 groups: Platelets and Unknown. The reason why there is Unknown cluster
might be due to the fact that SCtype database does not contain a huge
amount of different cells. Clustering for the Eyes results in
discovering 3 groups: Astrocytes, Glycinergic amacrine cells and
Unknown. Clustering for muscle results in discovering 3 groups: Schwann
cells, Smooth muscle cells and Unknown. Conclusion we can derive from
this is that this tool can give some navigation in terms of cell type
annotation, but not enough information since database lacks key tissues
and significant amount of cell types.
Test another cell type annotation method. You can choose among the methods below (scGPT, deCS and scAnnotate) or propose an alternative method.
#https://github.com/Winnie09/GPTCelltype
#https://www.nature.com/articles/s41592-024-02235-4
#install.packages("openai")
#remotes::install_github("Winnie09/GPTCelltype")
library("openai")
library("GPTCelltype")
#You need to add a payment plan to get there
#Sys.setenv(OPENAI_API_KEY = 'put your key here')#https://github.com/bsml320/deCS
#https://academic.oup.com/gpb/article/21/2/370/7585489?login=false
library(deCS)
# BlueprintEncode_main_t_score
data(BlueprintEncode_main)
# BlueprintEncode_fine_t_score
data(BlueprintEncode_fine)
# DICE_main_t_score
data(DICE_main)
# DICE_fine_t_score
data(DICE_fine)
# MonacoImmune_main_t_score
data(MonacoImmune_main)
# MonacoImmune_fine_t_score
data(MonacoImmune_fine)
# Human_cell_landscape (HCL_z_score)
data(Human_cell_landscape)
# Human_cell_atlas_of_fetal (HCAF_z_score)
data(Human_cell_atlas_of_fetal)
# Cell type marker gene list from CellMatch database
data(CellMatch)
GE_log_top8_markers_list = GE_log.markers[which(GE_log.markers$gene %in% top8$gene),]
GE_log_cluster_average = log(AverageExpression(GE_log)[[1]] + 1, 2)
GE_log_cluster_marker_average = GE_log_cluster_average[which(rownames(GE_log_cluster_average) %in% top8$gene), ]
GE_log_cluster_marker_z_score = t(scale(t(GE_log_cluster_marker_average)))
pbmc_deCS_cor_panel_A <- deCS.correlation(GE_log_cluster_marker_z_score, BlueprintEncode_main_t_score)## Query Max_correlation Top1 Top2 Top3 Cell_labels
## 1 g0 0.3972951 Eosinophils Pericytes Neutrophils Eosinophils
## 2 g1 0.2535657 Melanocytes Adipocytes Keratinocytes Melanocytes
## 3 g2 0.6417357 HSC Erythrocytes Neurons HSC
Using another tehnique for cell type annotation gives more results. Since there is more databases we can compare reference panels agains different cell types. If we select Blueprint Encode database which contains 259 RNA-seq samples of pure stroma and immune cells, we can see that maximum correlation is achieved for eosinophilis (type of white blood cells, immune system), and HSC (stem cells that give rise to other blood cells). Among top 2 and top 3 max correlated cell types we can also observe adipocytes that we have already identified above. Key point, we should extract from these two approaches is that they can both give many insights to cell type annotations base on reference panels. Even though there are variations in these results (which come from different databases and methodologies), they point to the same direction - Immune system, Skin, Eyes. Therefore, we should use these tools to collect as many useful information as possible.